Summary

### Monte Carlo Methods

Monte Carlo methods - even though the underlying problem involves a great degree of randomness, we can infer useful information that we can trust just by collecting a lot of samples.
The equiprobable random policy is the stochastic policy where - from each state - the agent randomly selects from the set of available actions, and each action is selected with equal probability.

Algorithms that solve the prediction problem determine the value function v_\pi (or q_\pi) corresponding to a policy \pi.
When working with finite MDPs, we can estimate the action-value function q_\pi corresponding to a policy \pi in a table known as a Q-table. This table has one row for each state and one column for each action. The entry in the s-th row and a-th column contains the agent's estimate for expected return that is likely to follow, if the agent starts in state s, selects action a, and then henceforth follows the policy \pi.
Each occurrence of the state-action pair s,a (s\in\mathcal{S},a\in\mathcal{A}) in an episode is called a visit to s,a.
There are two types of MC prediction methods (for estimating q_\pi):
- First-visit MC estimates q_\pi(s,a) as the average of the returns following only first visits to s,a (that is, it ignores returns that are associated to later visits).
- Every-visit MC estimates q_\pi(s,a) as the average of the returns following all visits to s,a.

A policy is greedy with respect to an action-value function estimate Q if for every state s\in\mathcal{S}, it is guaranteed to select an action a\in\mathcal{A}(s) such that a = \arg\max_{a\in\mathcal{A}(s)}Q(s,a). (It is common to refer to the selected action as the greedy action.)
In the case of a finite MDP, the action-value function estimate is represented in a Q-table. Then, to get the greedy action(s), for each row in the table, we need only select the action (or actions) corresponding to the column(s) that maximize the row.

A policy is \epsilon-greedy with respect to an action-value function estimate Q if for every state s\in\mathcal{S},
- with probability 1-\epsilon, the agent selects the greedy action, and
- with probability \epsilon, the agent selects an action uniformly at random from the set of available (non-greedy AND greedy) actions.

Algorithms designed to solve the control problem determine the optimal policy \pi_* from interaction with the environment.
The Monte Carlo control method uses alternating rounds of policy evaluation and improvement to recover the optimal policy.

All reinforcement learning agents face the Exploration-Exploitation Dilemma, where they must find a way to balance the drive to behave optimally based on their current knowledge (exploitation) and the need to acquire knowledge to attain better judgment (exploration).
In order for MC control to converge to the optimal policy, the Greedy in the Limit with Infinite Exploration (GLIE) conditions must be met:
- every state-action pair s, a (for all s\in\mathcal{S} and a\in\mathcal{A}(s)) is visited infinitely many times, and
- the policy converges to a policy that is greedy with respect to the action-value function estimate Q.

(In this concept, we amended the policy evaluation step to update the Q-table after every episode of interaction.)

(In this concept, we derived the algorithm for constant-\alpha MC control, which uses a constant step-size parameter \alpha.)
The step-size parameter \alpha must satisfy 0 < \alpha \leq 1. Higher values of \alpha will result in faster learning, but values of \alpha that are too high can prevent MC control from converging to \pi_*.